Spelling Suggestion
   HOME

TheInfoList



OR:

Spelling suggestion is a feature of many
computer software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists ...
applications used to suggest plausible replacements for words that are likely to have been misspelled. ''Spelling suggestion'' features are commonly included in
Internet The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a '' network of networks'' that consists of private, pub ...
search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
s,
word processor A word processor (WP) is a device or computer program that provides for input, editing, formatting, and output of text, often with some additional features. Word processor (electronic device), Early word processors were stand-alone devices ded ...
s,
spell checker In software, a spell checker (or spelling checker or spell check) is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic di ...
s,
medical transcription Medical transcription, also known as MT, is an allied health profession dealing with the process of transcribing voice-recorded medical reports that are dictated by physicians, nurses and other healthcare practitioners. Medical reports can be vo ...
, automatic query reformulation, and frequency-log statistics reporting.


Algorithms

Any spell checker must have some data about the words in the target language, either in general usage or with specialized knowledge (like medical vocabulary). This can come from: * A
dictionary A dictionary is a listing of lexemes from the lexicon of one or more specific languages, often arranged alphabetically (or by radical and stroke for ideographic languages), which may include information on definitions, usage, etymologies ...
of all known words. * A
text corpus In linguistics, a corpus (plural ''corpora'') or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In corpus linguistics, they are used to do statistical a ...
which includes typical text, known to be correctly spelled. * A list of frequently misspelled words, mapping errors to corrections. * Logs of human text input, such as from a popular
search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
. This is essentially a
crowdsourced Crowdsourcing involves a large group of dispersed participants contributing or producing goods or services—including ideas, votes, micro-tasks, and finances—for payment or as volunteers. Contemporary crowdsourcing often involves digita ...
corpus, but it is assumed there will be some spelling mistakes. Data might be included about when people click on a spelling suggestion or make a second, very similar query; this creates a crowdsourced mapping of misspelled words to reliable corrections. A list of frequently misspelled words, possibly including multi-word phrases, can simply be consulted to see if any of the input words or phrases are listed. To make use of a dictionary without a pre-existing mapping from misspellings to corrections, the typical technique is to calculate the
edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to tr ...
between an input word and any given word in the dictionary. The
Levenshtein distance In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-charact ...
metric considers an "edit" to be the insertion, deletion, or substitution (with another letter) of one letter. The
Damerau–Levenshtein distance In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein.) is a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Lev ...
adds transpositions (the swapping of neighboring letters). Dictionary words that are an edit distance of 1 away from the input word are considered highly likely as corrections, edit distance 2 less likely, and edit distance 3 sometimes included in suggestions and sometimes ignored. A text corpus can be summed up as a dictionary of known words, with a frequency of appearance for each word. This can be used to sort the spelling suggestions. For example, if there are multiple suggestions of edit distance 1, the words that appear most frequently in the corpus are most likely to be the desired correction. Because a dictionary of known words is very large, calculating the edit distance between an input word and every word in the dictionary is computationally intensive and thus relatively slow. Various
data structure In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, a ...
s can be utilized to speed up storage lookups, such as BK-trees. A faster approach adopted by Peter NorvigHow to Write a Spelling Corrector
/ref> generates all the
permutations In mathematics, a permutation of a set is, loosely speaking, an arrangement of its members into a sequence or linear order, or if the set is already ordered, a rearrangement of its elements. The word "permutation" also refers to the act or proc ...
from an input word of all possible edits. For a word of length ''n'' and an alphabet of size ''a'', for edit distance 1 there are at most ''n'' deletions, ''n-1'' transpositions, ''a*n'' alterations, and ''a*(n+1)'' insertions.1000x Faster Spelling Correction algorithm (2012)
/ref> Using only the 26 letters in the
English alphabet The alphabet for Modern English is a Latin-script alphabet consisting of 26 letters, each having an upper- and lower-case form. The word ''alphabet'' is a compound of the first two letters of the Greek alphabet, '' alpha'' and '' beta''. ...
, this would produce only ''54*n+25'' dictionary lookups, minus any duplicates (which depends on the specific letters in the word). This is relatively small when compared to a dictionary of hundreds of thousands of words. However, tens or hundreds of thousands of lookups might be required for edit distance 2 and greater. A further innovation adopted by Wolf Garbe, known as SymSpell ("sym" as in "symmetry") speeds the input-time calculation by utilizing the fact that only permutations involving deletions need to be generated for input words if the same deletion permutations are per-calculated on the dictionary. The algorithms described so far do not deal well with correct words that are not in the dictionary. Common sources of unknown words in English are
compound word In linguistics, a compound is a lexeme (less precisely, a word or sign) that consists of more than one stem. Compounding, composition or nominal composition is the process of word formation that creates compound lexemes. Compounding occurs when ...
s and
inflection In linguistic morphology, inflection (or inflexion) is a process of word formation in which a word is modified to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, mood, animacy, and defin ...
s, such as ''-s'' and ''-ing''. These can be accommodated algorithmically, especially if the dictionary contains the
part of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are assi ...
. These algorithms have also assumed that all errors of a given distance are equally probable, which is not true. Errors that involve spelling
phonetically Phonetics is a branch of linguistics that studies how humans produce and perceive sounds, or in the case of sign languages, the equivalent aspects of sign. Linguists who specialize in studying the physical properties of speech are phoneticians. ...
where
English orthography English orthography is the writing system used to represent spoken English, allowing readers to connect the graphemes to sound and to meaning. It includes English's norms of spelling, hyphenation, capitalisation, word breaks, emphasis, and p ...
is not phonetic are common, as are errors that repeat the same letter or confuse neighboring letters on a
QWERTY keyboard QWERTY () is a keyboard layout for Latin-script alphabets. The name comes from the order of the first six keys on the top left letter row of the keyboard ( ). The QWERTY design is based on a layout created for the Sholes and Glidden t ...
. If a large set of known spelling errors and corrections is available, this data can be used to generate frequency tables for letter pairs and edit types; these can be used to more accurately rank suggestions. It is also more common than chance for a word to be spelled in the wrong dialect compared to the rest of the text, for example due to
American and British English spelling differences Despite the various List of dialects of English, English dialects spoken from country to country and within different regions of the same country, there are only slight regional variations in English orthography, the two most notable variatio ...
. Spelling suggestions can also be made more accurate by taking into account more than one word at a time. Multi-word sequences are known as
n-gram In the fields of computational linguistics and probability, an ''n''-gram (sometimes also called Q-gram) is a contiguous sequence of ''n'' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or b ...
s (where ''n'' is the number of words in the sequence). A very large database of n-grams up to 5 words in length is available from Google for this and other purposes. Others have experimented with using large amounts of data and
deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...
techniques (a form of
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
to train
neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
to perform spelling correction.Algorithms and techniques for spell checking
/ref>


References

Internet search engines Spell checkers